CIS545: Final Project - CORD-19 Dataset

Justin Choi

TA: Hoyt Gong

Hi there! For this project, I opted to use the COVID-19 research challenge dataset, which they named "CORD-19" (I promise, the title wasn't a typo haha). The dataset of articles was created by the Allen Institute for AI, Chan Zuckerberg Initiative, Microsoft Research, NIH, and more; if you wanna check it out for yourself, you can either download it or check out the Kaggle webpage for it!

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import json

The Data

Thanks to all the ~ useful ~ skills we've picked up over the course of this semester, we'll start with everyone's favorite tedious, time-consuming task - data cleaning! woo hoooooOOOOooOo who doesn't lüv missing values and weird formatting.

First, we'll be utilizing the built-in json, os, and glob modules from Python to get each of the files from our directory and then extract the right text! From there, we'll be using our best friend pandas in order to aggregate all this text data into one dataframe along with it's associated metadata from the metadata.csv:

In [2]:
metadata_df = pd.read_csv('./CORD-19-research-challenge/metadata.csv')

# Import all the json files
cord_19_folder = './CORD-19-research-challenge/'
list_of_files = []; # only going to take those from pdf_json! not pmc_json
for root, dirs, files in os.walk(cord_19_folder):
    for name in files:
        if (name.endswith('.json')):
            full_path = os.path.join(root, name)
            list_of_files.append(full_path)
sorted(list_of_files)
print('done')

# ALTERNATE

# all_json = glob.glob(f'{cord_19_folder}/**/*.json', recursive=True)
# len(all_json)
done
In [ ]:
class JsonReader:
    def __init__(self, file_path):
        with open(file_path) as file: 
            content = json.load(file)
            # start to insert body text 
            self.paper_id = content['paper_id']
            self.body_text = [] 
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.body_text[:500]}...'

random_json = list_of_files[47404]
sample_article = JsonReader(random_json)
print(sample_article)

Now that we've extracted all the necessary text data, let's load it onto a dataframe and clean it out so that we don't have any null values in important places (e.g. title, body_text, etc)

In [4]:
input = {'paper_id': [], 'doi':[],  'title': [], 'abstract': [], 'body_text': [], 'authors': [], 'journal': []}

for i, entry in enumerate(list_of_files):
    if i % (len(list_of_files) // 25) ==    0:
        print(f'Processing {i} of {len(list_of_files)}')
    try: 
        article = JsonReader(entry)
    except Exception as e: 
        continue #means that we don't have a valid file format  
    
    metadata = metadata_df.loc[metadata_df['sha'] == article.paper_id]
    if len(metadata) == 0:
        continue # no such metadata for paper in our csv, skip

    input['body_text'].append(article.body_text)
    input['paper_id'].append(article.paper_id)

    # add in metadata 
    title = metadata['title'].values[0] 
    doi = metadata['doi'].values[0] 
    abstract = metadata['abstract'].values[0] 
    authors = metadata['authors'].values[0] 
    journal = metadata['journal'].values[0] 

    input['title'].append(title)
    input['doi'].append(doi)
    input['abstract'].append(abstract)
    input['authors'].append(authors)
    input['journal'].append(journal)
Processing 0 of 59311
Processing 2372 of 59311
Processing 4744 of 59311
Processing 7116 of 59311
Processing 9488 of 59311
Processing 11860 of 59311
Processing 14232 of 59311
Processing 16604 of 59311
Processing 18976 of 59311
Processing 21348 of 59311
Processing 23720 of 59311
Processing 26092 of 59311
Processing 28464 of 59311
Processing 30836 of 59311
Processing 33208 of 59311
Processing 35580 of 59311
Processing 37952 of 59311
Processing 40324 of 59311
Processing 42696 of 59311
Processing 45068 of 59311
Processing 47440 of 59311
Processing 49812 of 59311
Processing 52184 of 59311
Processing 54556 of 59311
Processing 56928 of 59311
Processing 59300 of 59311
In [5]:
covid_df = pd.DataFrame(input, columns=['paper_id', 'doi', 'title', 'abstract', 'body_text', 'authors', 'journal'])
print('finished creating dataframe from input dictionary')
rows, cols = covid_df.shape
print(f'number of rows: {rows}')
covid_df.info()
finished creating dataframe from input dictionary
number of rows: 36009
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36009 entries, 0 to 36008
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   36009 non-null  object
 1   doi        35672 non-null  object
 2   title      35973 non-null  object
 3   abstract   31675 non-null  object
 4   body_text  36009 non-null  object
 5   authors    35413 non-null  object
 6   journal    34277 non-null  object
dtypes: object(7)
memory usage: 1.9+ MB
In [6]:
covid_df.dropna(inplace=True)
print('finished dropping articles with null abstracts/body text/titles')
rows, cols = covid_df.shape
print(f'number of rows: {rows}')
finished dropping articles with null abstracts/body text/titles
number of rows: 29600
In [7]:
covid_df['body_word_count'] = covid_df['body_text'].apply(lambda x : len(x.strip().split()))
covid_df['body_unique_count'] = covid_df['body_text'].apply(lambda x : len(set(x.strip().split())))
In [8]:
# visualiation check to see if data is finished being cleaned 
covid_df.head()
Out[8]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count
0 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 10.1080/14787210.2017.1271712 Update on therapeutic options for Middle East ... Introduction: The Middle East Respiratory Synd... The Middle East respiratory syndrome coronavir... Al-Tawfiq, Jaffar A.; Memish, Ziad A. Expert Rev Anti Infect Ther 2748 996
1 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 10.1093/cid/ciaa256 A Novel Approach for a Novel Pathogen: using a... Thousands of people in the United States have ... The 2019 novel coronavirus (SARS-CoV-2), ident... Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic... Clin Infect Dis 944 486
2 ab680d5dbc4f51252da3473109a7885dd6b5eb6f 10.1016/b978-0-12-800049-6.00293-6 Evolutionary Medicine IV. Evolution and Emerge... Abstract This article discusses how evolutiona... The evolutionary history of humans is characte... Scarpino, S.V. Encyclopedia of Evolutionary Biology 2884 1091
3 6599ebbef3d868afac9daa4f80fa075675cf03bc 10.1016/j.enpol.2008.08.029 International aviation emissions to 2025: Can ... Abstract International aviation is growing rap... Sixty years ago, civil aviation was an infant ... Macintosh, Andrew; Wallace, Lailey Energy Policy 5838 1587
5 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 10.1093/jac/dkp502 Inhibition of enterovirus 71 replication and t... OBJECTIVES: Enterovirus 71 (EV71) causes serio... Enteroviruses are members of the family Picorn... Hung, Hui-Chen; Chen, Tzu-Chun; Fang, Ming-Yu;... J Antimicrob Chemother 3121 1064
In [9]:
covid_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29600 entries, 0 to 36008
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   paper_id           29600 non-null  object
 1   doi                29600 non-null  object
 2   title              29600 non-null  object
 3   abstract           29600 non-null  object
 4   body_text          29600 non-null  object
 5   authors            29600 non-null  object
 6   journal            29600 non-null  object
 7   body_word_count    29600 non-null  int64 
 8   body_unique_count  29600 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 2.3+ MB
In [10]:
#check to see if there are duplicates
covid_df['abstract'].describe()
Out[10]:
count       29600
unique      29543
top       Unknown
freq           21
Name: abstract, dtype: object
In [11]:
covid_df.drop_duplicates(subset=['abstract', 'body_text'], inplace=True)
In [12]:
covid_df.describe()
Out[12]:
body_word_count body_unique_count
count 29597.000000 29597.000000
mean 4560.325878 1425.199007
std 3528.817632 748.579044
min 2.000000 2.000000
25% 2705.000000 989.000000
50% 3847.000000 1288.000000
75% 5533.000000 1695.000000
max 171948.000000 25156.000000

Yayyy go cleaned data! Now that we have a cleaned out dataframe, one of the first things we want to do is find the language of each of these research articles, as not all of them are in English! We'll do some EDA on this in a bit, but for the rest of the project we're going to filter out any articles that aren't in English, just so we can simplify our modeling later down the line:

In [13]:
from langdetect import detect
from tqdm import tqdm

languages = [] # make list that you can port directly into covid_df as column

for i in tqdm(range(len(covid_df))):
    row_text = covid_df.iloc[i]['body_text'].split(" ") 
    lang = 'en' # set default lang to be english 

    # try to just use the intro 25 words to detect language
    try:
        if (len(row_text)) > 125: 
            lang = detect(" ".join(row_text[:125]))
        elif(len(row_text)) > 0:
            lang = detect(" ".join(row_text))
    except Exception as e: # if body doesn't work, let's try abstract
        try: 
            lang = detect(covid_df.iloc(i)['abstract'].split())
        except Exception as e:
            lang = 'dunno'
            continue
    finally:
        languages.append(lang)
        
100%|██████████| 29597/29597 [02:54<00:00, 169.76it/s]
In [14]:
lang_array = np.asarray(languages)
covid_df['language'] = lang_array
lang_dict = {}
for language in lang_array:
    if language in lang_dict:
        lang_dict[language] += 1
    else: 
        lang_dict[language] = 1
lang_dict
Out[14]:
{'en': 29045,
 'fr': 246,
 'nl': 36,
 'es': 171,
 'it': 14,
 'de': 67,
 'pt': 12,
 'cy': 2,
 'pl': 2,
 'dunno': 1,
 'zh-cn': 1}
In [15]:
# test to see if languages were detected correctly
covid_df[covid_df['language'] == 'nl'].head(10)
Out[15]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language
56 7d1d1ee1fcc8f713f374f3d693861913a2730ac4 10.1007/978-90-313-6316-2_5 5 Zorgvuldig en verantwoord werken Verzorgenden in verpleeg- en verzorgingshuizen... Verzorgenden in verpleeg-en verzorgingshuizen ... Dito, J.C.; Stavast, T.; Zwart, D.E. Basiszorg Boek 1 3808 1563 nl
390 74276b4789e8384757bce159941cef1e3920f312 10.1007/978-90-368-1813-1_2 Ademhaling Dyspneu is een subjectieve sensatie van ineffi... Klachten van het ademhalingsapparaat vormen va... Huybrechts, B. P. M. Leerboek spoedeisende-hulp-verpleegkunde 14084 3991 nl
2020 8623787ac0fead8e9a536807c547de64db135ec9 10.1007/978-90-313-7944-6_2 2 Infecties van de bovenste luchtwegen Luchtweginfecties in het algemeen en infecties... Luchtweginfecties in het algemeen en infecties... de Jong, M.D.; Wolfs, T.F.W. Microbiologie en infectieziekten 5965 2143 nl
2351 fa2318e3d0c2345a58e8380e1fa731beb90ef808 10.1007/978-90-368-1230-6_11 Openbare gezondheid en preventie In dit hoofdstuk staat de wetgeving op het geb... sche) maatregelen om infectieziekten te voorko... van Noord, T. J. C. Recht en gezondheidszorg 31722 6949 nl
2840 9ce7d47fb8fc1f1391d59815ec1f49abe3198640 10.1007/978-90-368-1629-8_10 Infectieziekten Infectie ontstaat bij een stoornis in de inter... therapie en preventie infectie gastheer micro-... Kullberg, B. J.; van der Meer, J. W. M.; Warri... Codex Medicus 12729 4552 nl
6189 8729e026c64a1670121c7015a3b8e5038e7a16fd 10.1007/978-90-368-0945-0_15 Importziekten Importziekten komen voor bij reizigers en migr... Uit deze driehoek komt een risicoschatting voo... Overbosch, D.; van Genderen, P.J.J. Differenti&#x000eb;le diagnostiek in de intern... 7418 2512 nl
6258 56ad43b80c526a6d254fd8f437f11de42db69584 10.1007/978-90-368-1320-4_1 1 Inleiding In augustus 2011 wijdde de Journal of the Amer... In augustus 20ıı wijdde de Journal of the Amer... Mackenbach, J.P.; Stronks, K. Volksgezondheid en gezondheidszorg 7296 2399 nl
6512 ddd4c2cf558b7512c19129c4e2af19d76cc4cf1d 10.1007/978-90-313-7944-6_18 18 Zoönosen Een zoönose is een ziekte die van dier op mens... Een zoönose is een ziekte die van dier op mens... Kortbeek, L.M.; de Vries, P.J.; Langelaar, M. Microbiologie en infectieziekten 8748 2893 nl
6924 6b85f3551b9c7e9a38b371f8aeb7e3e1dc254e6c 10.1007/978-90-368-1442-3_7 Vlekjes Mijn dochtertje (groep I) komt thuis met het b... . In Nederland doen epidemieën zich ongeveer o... Abraham-Inpijn, Luzi Tandarts in de knel 2092 967 nl
7274 6822e97a6d9a0ab5d9cf659281d70a5bad2b2e45 10.1007/978-90-368-1629-8_4 Aandoeningen van ademhalingsstelsel, mediastin... Aet. Holtevorming in een ontstekingsproces dat... chronische gevallen nog duidelijker wegens de ... Decramer, M. L. A.; Van Schil, P. E. Y.; Vanst... Codex Medicus 15245 5666 nl
In [16]:
which_language = {
    'en': 'English',
    'es': 'Spanish',
    'fr': 'French', 
    'it': 'Italian', 
    'zh-cn': 'Chinese', 
    'pl': 'Polish', 
    'cy': 'Welsh', 
    'de': 'German', 
    'pt': 'Portugese', 
    'nl': 'Dutch', 
    'dunno': 'Unknown'
}
In [17]:
test = covid_df['language'].apply(lambda x : which_language[x])
covid_df['language'] = test
covid_df['language']
Out[17]:
0        English
1        English
2        English
3        English
5        English
          ...   
36004    English
36005    English
36006    English
36007    English
36008    English
Name: language, Length: 29597, dtype: object

EDA

Now, we can start exploring the cleaned dataset! Since we just extracted the languages of each of the text, let's just see how they're distributed as a warm-up to our visualizations:

In [18]:
import seaborn as sns

Language Distribution

For the rest of the visualization throughout this project, I'll be using Seaborn over matplotlib thanks to its additional nice features as well as ~aesthetic~ appeal, so if you have anything you want to refer to, you can always check the documentation!

In [19]:
lang_distribution = covid_df['language'].value_counts()
lang_distribution
Out[19]:
English      29045
French         246
Spanish        171
German          67
Dutch           36
Italian         14
Portugese       12
Welsh            2
Polish           2
Unknown          1
Chinese          1
Name: language, dtype: int64

(For this language plot, I changed to across logarithmic scale so that languages other than English could actually show up on the bar plot lol)

In [21]:
sns.set()
fig, ax = plt.subplots(figsize=(15, 15))
ax.set_yscale('log')
plt.xticks(rotation=45)
sns.barplot(x=lang_distribution.index, y=lang_distribution.values, ax=ax, palette=sns.color_palette('Blues_r', len(lang_distribution)))
plt.show()
In [22]:
covid_df[covid_df['language'] == 'French']
Out[22]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language
34 82500c03d57e67a212959c13a049dc82c24759cd 10.1016/j.patbio.2008.04.005 Métapneumovirus humain Résumé Le métapneumovirus humain (hMPV) est un... Le métapneumovirus humain (hMPV) a été découve... Freymuth, F.; Vabret, A.; Legrand, L.; Dina, J... Pathologie Biologie 3666 1353 French
40 acd84940fc5cd8e8f54efd04ab672f5afbd2d7df 10.1016/s0335-7457(96)80118-3 Infections virales et asthme Summary Rhinovirus, parainfluenza, influenza, ... Les infections respiratoires ~t rhinovirus, pa... Radermecker, M. Revue Française d'Allergologie et d'Immunologi... 2295 1029 French
107 ec96d05a4d7eb88ed151f2f8818802a4a4d8a6cd 10.1016/s0929-693x(07)80019-4 Diarrhées aiguës virales : aspects cliniques e... Abstract The molecular characterization of gas... Les nouvelles m6thodes immunologiques et mol6c... Olives, J.-P.; Mas, E. Archives de Pédiatrie 1337 643 French
298 422abeb54c5d650351bfd5f471c92cba61b440fa 10.1016/s0929-693x(97)83481-1 Épidémiologie des pneumopathies communautaires... Summary Viruses, particularly syncitial respir... Les infections respiratoires basses sont un mo... Marguet, C; Bocquel, N; Mallet, E Archives de Pédiatrie 817 470 French
347 31c3f74777c16704751dde30c4c5a9495890bff6 10.1016/s1773-035x(15)30110-6 Rôle des animaux vertébrés dans l’épidémiologi... Résumé Les zoonoses, distinguées ici des malad... Les zoonoses représentent un groupe particulie... Moutou, François Revue Francophone des Laboratoires 3499 1385 French
... ... ... ... ... ... ... ... ... ... ...
23443 b08934a33d21cad2d1fbde47f8772448980cb417 10.1007/s13546-011-0314-3 Infections respiratoires virales à herpesvirid... Herpesviridae, including herpes simplex virus ... Résumé Les herpesviridae, essentiellement l'he... Luyt, C. -E. Reanimation 3411 1035 French
23497 70d49d193252e9892acf1e69f0f8a4305c3458ca 10.1016/s1294-5501(07)73918-3 Infectiologie itinérante Résumé Objectifs Une prise de conscience s’imp... Ce titre quelque peu énigmatique est l'occasio... Rey, M. Antibiotiques 4052 1594 French
23743 60e14b8a9c26c1cfb37d95fef2c3f95bc6f2f0b7 10.1016/j.medmal.2004.09.005 Les maladies infectieuses émergentes : importa... Résumé À la fin des années 1970 on a parlé « d... À la fin des années 1970 on a parlé « de la fi... Desenclos, J.-C.; De Valk, H. Médecine et Maladies Infectieuses 8073 2450 French
23848 000eec3f1e93c3792454ac59415c928ce3a6b4ad 10.1016/j.reaurg.2004.02.009 Pneumonie virale sévère de l'immunocompétent Résumé Les infections virales respiratoires co... Les pathologies infectieuses respiratoires son... Guery, B; d'Escrivan, T; Georges, H; Legout, L... Réanimation 6102 2082 French
29382 e0833a1c57d22b36db54876afb2282e00148b691 10.1080/17290376.2017.1375426 Contemporary HIV/AIDS research: Insights from ... Knowledge management as a field is concerned w... La gestion du savoir en tant que domaine est u... Callaghan, Chris William SAHARA J 8331 2564 French

246 rows × 10 columns

Journal Contributions

Next up, let's see what journal has published the most relevant research on COVID-19 and coronaviruses within this specific dataset!

Since we can only fit so many journals onto one bar plot, we're going to limit the journals we display to those that have contributed more than 100 articles to this dataset:

In [23]:
journal_dist = covid_df['journal'].value_counts()
above_100 = journal_dist[journal_dist.values > 100]
above_100
Out[23]:
PLoS One                                               1518
Virology                                                678
Viruses                                                 548
Emerg Infect Dis                                        507
Arch Virol                                              489
Sci Rep                                                 438
Veterinary Microbiology                                 402
Virus Research                                          385
Virol J                                                 355
Journal of Virological Methods                          345
Vaccine                                                 311
PLoS Pathog                                             303
Antiviral Research                                      268
BMC Infect Dis                                          238
Front Immunol                                           211
Journal of Clinical Virology                            210
Front Microbiol                                         197
J Infect Dis                                            180
American Journal of Infection Control                   178
Clin Infect Dis                                         175
Veterinary Immunology and Immunopathology               163
BMC Vet Res                                             162
Nucleic Acids Res                                       160
Biochemical and Biophysical Research Communications     150
BMC Public Health                                       140
Influenza Other Respir Viruses                          140
The Lancet                                              131
mBio                                                    131
PLoS Negl Trop Dis                                      127
International Journal of Infectious Diseases            118
Int J Mol Sci                                           115
Infection, Genetics and Evolution                       112
Virus Genes                                             110
Molecules                                               103
Name: journal, dtype: int64

Now that we have a filtered list, let's organize it from greatest to least to see how different the contributions are from various journals:

In [24]:
fig, ax_sized = plt.subplots(figsize=(20, 20))
plt.xticks(rotation=90)
sns.barplot(x=above_100.index, y=above_100.values, ax=ax_sized, palette=sns.color_palette('Blues_r', len(above_100)))
plt.show()

Publication Date

Let's now analyze when our research papers were published! Given that COVID-19 is only a very recent species of the many other coronaviruses we know of, it'll be interesting to see how research activity has developed over the course of time, and if intuition serves us right, we'll presumably see a large spike in research publication from the past few months.

First, we're going to change the publish_time column into datetime objects so we can better work with the data!

In [25]:
from datetime import datetime

test = pd.merge(covid_df, metadata_df[['sha', 'publish_time']], left_on='paper_id', right_on='sha', how='left')
test['publish_time'] = pd.to_datetime(test['publish_time'], infer_datetime_format=True)
test
Out[25]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language sha publish_time
0 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 10.1080/14787210.2017.1271712 Update on therapeutic options for Middle East ... Introduction: The Middle East Respiratory Synd... The Middle East respiratory syndrome coronavir... Al-Tawfiq, Jaffar A.; Memish, Ziad A. Expert Rev Anti Infect Ther 2748 996 English 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 2016-12-24
1 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 10.1093/cid/ciaa256 A Novel Approach for a Novel Pathogen: using a... Thousands of people in the United States have ... The 2019 novel coronavirus (SARS-CoV-2), ident... Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic... Clin Infect Dis 944 486 English 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 2020-03-12
2 ab680d5dbc4f51252da3473109a7885dd6b5eb6f 10.1016/b978-0-12-800049-6.00293-6 Evolutionary Medicine IV. Evolution and Emerge... Abstract This article discusses how evolutiona... The evolutionary history of humans is characte... Scarpino, S.V. Encyclopedia of Evolutionary Biology 2884 1091 English ab680d5dbc4f51252da3473109a7885dd6b5eb6f 2016-12-31
3 6599ebbef3d868afac9daa4f80fa075675cf03bc 10.1016/j.enpol.2008.08.029 International aviation emissions to 2025: Can ... Abstract International aviation is growing rap... Sixty years ago, civil aviation was an infant ... Macintosh, Andrew; Wallace, Lailey Energy Policy 5838 1587 English 6599ebbef3d868afac9daa4f80fa075675cf03bc 2009-01-31
4 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 10.1093/jac/dkp502 Inhibition of enterovirus 71 replication and t... OBJECTIVES: Enterovirus 71 (EV71) causes serio... Enteroviruses are members of the family Picorn... Hung, Hui-Chen; Chen, Tzu-Chun; Fang, Ming-Yu;... J Antimicrob Chemother 3121 1064 English 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 2010-01-20
... ... ... ... ... ... ... ... ... ... ... ... ...
29597 228650bc0429064d800d4b9c5fb0e00c2533a579 10.1371/journal.pone.0215186 Lipidome profiles of postnatal day 2 vaginal s... We hypothesized that postnatal development of ... Early nutritional environment affects long ter... Harlow, KaLynn; Ferreira, Christina R.; Sobrei... PLoS One 4139 1489 English 228650bc0429064d800d4b9c5fb0e00c2533a579 2019-09-26
29598 2246e28681bde69c65dc9081df367bb661997f19 10.1371/journal.pntd.0000690 Secondary Syphilis in Cali, Colombia: New Conc... Venereal syphilis is a multi-stage, sexually t... Syphilis is a sexually transmitted disease (ST... Cruz, Adriana R.; Pillay, Allan; Zuluaga, Ana ... PLoS Negl Trop Dis 5621 1976 English 2246e28681bde69c65dc9081df367bb661997f19 2010-05-18
29599 577c6a13f9ef70e9756890fc66e98f537c01ac0a 10.1038/srep21878 Replication and shedding of MERS-CoV in Jamaic... The emergence of Middle East respiratory syndr... Scientific RepoRts | 6:21878 | DOI: 10 .1038/s... Munster, Vincent J.; Adney, Danielle R.; van D... Sci Rep 2832 1012 English 577c6a13f9ef70e9756890fc66e98f537c01ac0a 2016-02-22
29600 c5c2bc7a07670d6fb970d84a59aab3832752a3f1 10.3390/v10040199 Role of the ERK1/2 Signaling Pathway in the Re... We have previously shown that the infection of... Arenaviruses are enveloped RNA viruses contain... Brunetti, Jesús E.; Foscaldi, Sabrina; Quintan... Viruses 4805 1522 English c5c2bc7a07670d6fb970d84a59aab3832752a3f1 2018-04-17
29601 ba29366173f97f54a22e5c410b3d05e9a9649d28 10.3390/insects10110394 Foodborne Transmission of Deformed Wing Virus ... Virus host shifts occur frequently, but the wh... Emerging infectious diseases (EIDs) can cause ... Schläppi, Daniel; Lattrell, Patrick; Yañez, Or... Insects 3236 1237 English ba29366173f97f54a22e5c410b3d05e9a9649d28 2019-11-07

29602 rows × 12 columns

In [26]:
covid_df = test
covid_df.head() 
Out[26]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language sha publish_time
0 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 10.1080/14787210.2017.1271712 Update on therapeutic options for Middle East ... Introduction: The Middle East Respiratory Synd... The Middle East respiratory syndrome coronavir... Al-Tawfiq, Jaffar A.; Memish, Ziad A. Expert Rev Anti Infect Ther 2748 996 English 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 2016-12-24
1 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 10.1093/cid/ciaa256 A Novel Approach for a Novel Pathogen: using a... Thousands of people in the United States have ... The 2019 novel coronavirus (SARS-CoV-2), ident... Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic... Clin Infect Dis 944 486 English 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 2020-03-12
2 ab680d5dbc4f51252da3473109a7885dd6b5eb6f 10.1016/b978-0-12-800049-6.00293-6 Evolutionary Medicine IV. Evolution and Emerge... Abstract This article discusses how evolutiona... The evolutionary history of humans is characte... Scarpino, S.V. Encyclopedia of Evolutionary Biology 2884 1091 English ab680d5dbc4f51252da3473109a7885dd6b5eb6f 2016-12-31
3 6599ebbef3d868afac9daa4f80fa075675cf03bc 10.1016/j.enpol.2008.08.029 International aviation emissions to 2025: Can ... Abstract International aviation is growing rap... Sixty years ago, civil aviation was an infant ... Macintosh, Andrew; Wallace, Lailey Energy Policy 5838 1587 English 6599ebbef3d868afac9daa4f80fa075675cf03bc 2009-01-31
4 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 10.1093/jac/dkp502 Inhibition of enterovirus 71 replication and t... OBJECTIVES: Enterovirus 71 (EV71) causes serio... Enteroviruses are members of the family Picorn... Hung, Hui-Chen; Chen, Tzu-Chun; Fang, Ming-Yu;... J Antimicrob Chemother 3121 1064 English 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 2010-01-20

Now that we have the data in the right format, let's extract the publication year and visualize it!

In [27]:
years_df = covid_df.copy()
years_df['publish_year'] = covid_df['publish_time'].apply(lambda x : x.year)
dates = years_df['publish_year'].value_counts()
dates
Out[27]:
2019    2652
2018    2422
2017    2282
2016    2244
2015    2067
2014    1932
2013    1777
2012    1584
2011    1435
2020    1432
2010    1338
2009    1267
2008    1112
2007    1005
2006     920
2005     844
2004     730
2003     318
1991     138
1992     133
1995     125
2002     122
1993     115
2000     114
1988     110
2001     108
1990     107
1998     106
1994     103
1989     101
1996     100
1999      98
1997      96
1987      95
1986      83
1985      53
1984      51
1981      45
1983      44
1982      35
1979      30
1980      30
1978      24
1977      21
1976      15
1975      12
1973       7
1970       7
1974       7
1972       4
1971       2
Name: publish_year, dtype: int64
In [28]:
fig, ax = plt.subplots(figsize=(20, 20))
plt.xticks(rotation=45)
sns.barplot(x=dates.index, y=dates.values, ax=ax, palette=sns.color_palette('Blues', len(dates)))
plt.show()

A super interesting thing to note here is that coronavirus research was relatively low level up until a huge spike in 2004; this directly coincides with the SARS outbreak from 2002-2003, hence why we can observe a huge spike in research published the following year (!!) This was only further fueled after the outbreak of MERS in 2012, a different species of coronavirus that began to spread in the Middle East.

Another thing to note is that 2020 is only around halfway through at the time of this writing (May, 2020) - yet, enough research has already been published (and of course, countless other scientsts are likely working on new research as we speak) that it's on pace to likely crush numbers from any prior years.

Let's now focus on the past two years of research, and compare month by month to see how research activity has developed. Again, if our assumptions are correct, we should see a large spike in activity in the first few months of 2020:

In [29]:
two_years_df = years_df.copy()
two_years_df['month'] = years_df['publish_time'].apply(lambda x : x.month)
nineteen_df = two_years_df[two_years_df['publish_year'] == 2019]
twenty_df = two_years_df[two_years_df['publish_year'] == 2020]

nineteen_counts = nineteen_df['month'].value_counts()
nineteen_counts.sort_index(inplace=True)
nineteen_counts.rename('2019', inplace=True)
nineteen_counts.fillna(value=0, inplace=True)


twenty_counts = twenty_df['month'].value_counts()
twenty_counts.sort_index(inplace=True)
twenty_counts.rename('2020', inplace=True)
twenty_counts.fillna(value=0, inplace=True)
twenty_counts.iloc[5:12] = 0

combined_counts = pd.concat([nineteen_counts, twenty_counts], axis=1)
combined_counts.reset_index(inplace=True)
number_to_month = {
    1: 'Jan', 
    2: 'Feb',
    3: 'Mar',
    4: 'Apr',
    5: 'May',
    6: 'Jun',
    7: 'Jul',
    8: 'Aug',
    9: 'Sept',
    10: 'Oct',
    11: 'Nov',
    12: 'Dec'
}
combined_counts['month'] = combined_counts['index'].apply(lambda x : number_to_month[x])
combined_counts.drop(['index'], axis=1, inplace=True)
month_data = combined_counts.melt('month', var_name='Year', value_name='Number of Papers')

fig, ax = plt.subplots(figsize=(20, 20))
sns.barplot(x='month', y='Number of Papers', hue='Year', ax=ax, data=month_data, palette='coolwarm')
plt.show()

Interesting Note

The 2020 months are actually supposed to have even more articles attributed to them, but the metadata_df unfortunately had the wrong publication date for quite a few of the 2020 articles, (i.e. it defaulted to just setting them as being published in December when they were just in a journal published in, say, June 2020, which is in the "future"); because of this, I had to remove around ~ 150 articles from the 2020 dataset.

Even then we can see that 2020 has obviously had a massive spike in research. Again, this has some cool correspondences to the actual timeline of the disease: late-December/early-January is when WHO first made a risk assessment of the disease and China publicly shared COVID-19's genetic sequence; late-January WHO declared a global health emergency, and February was Wuhan went under lockdown, the disease began spreading rapidly to Europe and the U.S., Wuhan went under lockdown, and the US confirmed it's first case, alongside other nations such as South Korea and France who had also reported new cases. And this directly reflects in our data, as it's evident that March had a MASSIVE spike in research, with pretty much double the amount of papers published.

NLP + Feature Extraction

Now that we've successfully cleaned out the data and gotten all the text in a consistent fashion, now we're going to create a bag-of-words model and vectorize each of the documents! This'll then allow us to do some better visualization and run some cool ~ machine learning ~ like PCA and t-SNE to both reduce dimensionality and visualize this better! Our main tool to do this will be NLTK, so if there are any questions concerning any of the methods used, just check out their documentation here:

In [30]:
dropped = covid_df[covid_df['language'] == 'English'] # i.e. only select articles written in english, as it'll help parsing/NLP 
In [31]:
covid_df['language'].describe()
Out[31]:
count       29602
unique         11
top       English
freq        29049
Name: language, dtype: object

First, let's import NLTK and download the necessary pre-trained model as well as stopwords to filter out our text!

In [32]:
# NLP analysis using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords 
import nltk.tokenize as t
import re

stop_words = list(set(stopwords.words('english')))
print(stop_words)
['shouldn', 'there', 'their', 'don', 'you', "you've", 'his', 'can', 'hadn', 'y', 'how', 'very', 'on', 'after', "you'd", 'he', 'to', 'out', 'my', 'few', 'doing', "aren't", 'your', 'further', 'the', 'against', 'myself', 'which', 'a', 'yourself', 'and', "weren't", 'who', 'whom', 'over', 'about', 'them', 'wouldn', 'himself', 'under', 'mustn', 'him', 'no', "hadn't", "wasn't", 'our', 'wasn', 'here', 'before', "it's", 'an', 'aren', 'some', 'it', 'hasn', 'then', 'being', 'been', 's', 'hers', 'haven', "shan't", 'but', 'yours', 'we', "you're", "you'll", 'not', 'needn', 'in', 'has', 're', 'isn', 'had', 'when', 'now', 'where', 'should', 'just', "mightn't", 'or', "shouldn't", 'ours', 'above', 'than', 'down', 'i', "don't", 'off', 'me', 'theirs', 'into', "couldn't", 'while', "she's", 'o', 'own', 'each', "haven't", 'what', 'themselves', 'does', 'through', "mustn't", 'this', 'up', 'more', 'if', 'll', "hasn't", 'd', "needn't", 'couldn', 'only', 'her', 'do', 'weren', 'other', 'again', 'all', 't', "wouldn't", 'at', 'that', 'ourselves', 'did', 'having', 'of', 'its', 'with', "should've", 'by', 'm', 'am', 'nor', 'ain', "isn't", 'is', 'are', 'why', 'itself', 'same', 'so', 'shan', 'were', 'those', "didn't", "won't", 'such', 'from', 'as', 'too', 'didn', 'won', 'any', 'most', 've', 'for', 'until', 'doesn', 'ma', 'was', "that'll", 'be', 'once', "doesn't", 'herself', 'these', 'will', 'during', 'they', 'below', 'because', 'both', 'yourselves', 'have', 'between', 'mightn', 'she']
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/justinchoi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/justinchoi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Since biomedical journals also have a ton of reoccuring words that are common-place, we'll add these to our list of stopwords so that we can further remove useless data and ultimately give us a cleaner model at the end:

In [33]:
# add in additional stopwords frequently used in biomedical/research articles
bio_stop_words = ['doi', 'preprint', 'copyright', 'www', 'PMC', 'pmc', 'al.', 'fig', 'fig.', 'permission', 'used', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 'rights', 'reserved', 'biorxiv', 'medrxiv', 'license', 'CZI', 'czi']

for word in bio_stop_words:
    if word not in stop_words:
        stop_words.append(word)
        
print(stop_words)
['shouldn', 'there', 'their', 'don', 'you', "you've", 'his', 'can', 'hadn', 'y', 'how', 'very', 'on', 'after', "you'd", 'he', 'to', 'out', 'my', 'few', 'doing', "aren't", 'your', 'further', 'the', 'against', 'myself', 'which', 'a', 'yourself', 'and', "weren't", 'who', 'whom', 'over', 'about', 'them', 'wouldn', 'himself', 'under', 'mustn', 'him', 'no', "hadn't", "wasn't", 'our', 'wasn', 'here', 'before', "it's", 'an', 'aren', 'some', 'it', 'hasn', 'then', 'being', 'been', 's', 'hers', 'haven', "shan't", 'but', 'yours', 'we', "you're", "you'll", 'not', 'needn', 'in', 'has', 're', 'isn', 'had', 'when', 'now', 'where', 'should', 'just', "mightn't", 'or', "shouldn't", 'ours', 'above', 'than', 'down', 'i', "don't", 'off', 'me', 'theirs', 'into', "couldn't", 'while', "she's", 'o', 'own', 'each', "haven't", 'what', 'themselves', 'does', 'through', "mustn't", 'this', 'up', 'more', 'if', 'll', "hasn't", 'd', "needn't", 'couldn', 'only', 'her', 'do', 'weren', 'other', 'again', 'all', 't', "wouldn't", 'at', 'that', 'ourselves', 'did', 'having', 'of', 'its', 'with', "should've", 'by', 'm', 'am', 'nor', 'ain', "isn't", 'is', 'are', 'why', 'itself', 'same', 'so', 'shan', 'were', 'those', "didn't", "won't", 'such', 'from', 'as', 'too', 'didn', 'won', 'any', 'most', 've', 'for', 'until', 'doesn', 'ma', 'was', "that'll", 'be', 'once', "doesn't", 'herself', 'these', 'will', 'during', 'they', 'below', 'because', 'both', 'yourselves', 'have', 'between', 'mightn', 'she', 'doi', 'preprint', 'copyright', 'www', 'PMC', 'pmc', 'al.', 'fig', 'fig.', 'permission', 'used', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 'rights', 'reserved', 'biorxiv', 'medrxiv', 'license', 'CZI', 'czi']
In [34]:
# helper function to remove punctuation from sentences 
def remove_punctuation(sentence):
    sentence = re.sub(r'[^\w\s]','', sentence)
    return sentence

def remove_stopwords(sentence):
    return [word for word in sentence if word not in stop_words]

def parse_text(text): 
    sentences = t.sent_tokenize(text)
    sentences = [sentence.lower() for sentence in sentences] # lower case the text so that we can correctly remove text
    cleaned_sentences = [remove_punctuation(sentence) for sentence in sentences]
    tokenized_sentences = [t.word_tokenize(clean_sentence) for clean_sentence in cleaned_sentences]
    filtered_sentences = [remove_stopwords(t_sentence) for t_sentence in tokenized_sentences]
    tokens = ''
    for sentence in filtered_sentences:
        sentence_str = ' '.join(sentence)
        tokens = tokens + sentence_str + ' '
    return tokens
In [35]:
# checking to see if the function parse_text works correctly

test_article = covid_df.iloc[2020]['body_text']
test_article
output = parse_text(test_article)
print(output)
emerging pathogen defined causative agent infectious disease whose incidence increasing following appearance new host population whose incidence increasing existing host population result longterm changes underlying epidemiology 1 one potential source emerging pathogen different host species reservoir pathogen already established table 1 switches one host species another species jumps led devastating disease epidemics recorded including ongoing hivaids pandemic human communities worldwide decimation european rabbit population myxomatosis mid 20th century catastrophic impact rinderpest african ruminants late 19th century recently widespread mortality north sea seals result distemper 2 3 4 5 even argued many main killer diseases humans eg measles tb influenza smallpox emerged pathogens jumping domestic animals humans past 10 000 years 6 species jumps also given rise devastating epidemics plant pathogens crop species eg potato late blight cultivated potato wild plant species eg near extinction american chestnut trees chestnut blight 7 8 9 conversely numerous examples species jumps far less dramatic consequences example bsevcjd ebola virus humans although undoubtedly serious problems show signs taking way hivaids moreover many pathogens long history routinely jumping species eg rabies virus humans domestic wild carnivores without triggering major epidemics new host population understanding epidemiology evolutionary biology underlying differences crucial understanding phenomenon emerging infectious diseases human domestic animal wildlife plant populations epidemiological theory well developed conceptual framework evaluating spread infection host population box 1 expected size outbreak depends upon number introductions socalled primary cases infection potential transmission pathogen one new host another box 1 ia transmission potential expressed terms basic reproduction number r 0 pathogens enter new host population via species jump placed two categories depending value r 0 1 new host population even new host repeatedly acquires pathogen limited spread infection within population category emerging pathogen unlikely constitute greatest disease threat examples humans include ebola monkeypox avian influenza viruses vcjd agent conversely r 0 o1 new host population finite chance box 1 ib major epidemic occur category likely constitute greatest disease threat examples humans include hiv influenza type virus sars coronavirus key difference two categories lies origin infections within new host population r 0 1 large proportion infections acquired directly original source host population r 0 o1 outbreak takes infections acquired within new host population resulting positive feedback potentially fuelling major epidemic transition behaviours region r 0 z1 size epidemic highly sensitive small changes transmission potential box 1 ia especially relevant pathogen emergence implies relatively small changes r 0 large impacts incidence infection widely discussed many reasons r 0 might change include changes host ecology environment example urbanization cited key factor increasing transmission potential many human viral bacterial infections 10 11 extremely high densities european wheat barley crops diseases yellow rust powdery mildew respectively 12 13 another example ongoing concern climate change might associated changing distributions vectorborne diseases tickborne encephalitis lyme disease 14 15 16 changes host behaviour movements example patterns sexual behaviour directly affect potential spread sexually transmitted diseases 10 17 global travel exacerbated spread sars 18 changes host phenotype example immunosuppression hospital treatments due effects hivaids cited contributing spread numerous infections eg fungal pathogen pneumocystis carinii 10 11 17 loss crossimmunity acquired immune responses induced exposure one microorganism least partially protective infection another quite general concept example underlies efficacy bcg vaccination might also increase potential invasions new pathogens suggested several pairings yaws syphilis leprosy tb yellow fever dengue fever smallpox monkeypox vivax malaria falciparum malaria 17 19 20 changes host genetics example loss major histocompatibility complex haplotypes genetic diversity inbred livestock small populations might increase susceptibility infection 21 among plants use single cultivars increases vulnerability many crop species widespread epidemics pathogens spilling closely related wild host species example commercial banana plantations composed single clone risk various races panama disease fusarium oxysporum 22 changes pathogen genetics discussed detail discussion considers fate primary cases infection introduced new host population understand species jumps need consider greater detail biological processes occurring around jump first step exposure new host species pathogen 1 rate exposure function ecologies behaviour two host species transmission biology pathogen including biology vectors involved indeed ecological change broadest sense associated instances disease emergence 11 23 24 example phocine distemper lyme disease bse hepatitis c vectorborne diseases exposure new host species might facilitated pathogen jumping vector species populations suggested venezuelan equine encephalitis virus veev 25 review trends ecology evolution vol20 5 may 2005 second step pathogen able infect new host pathogen host compatible pathogens highly variable host ranges naturally infect single species eg mumps virus plasmodium falciparum humans whereas others infect hosts different taxonomic orders even classes eg rabies virus protozoan blastocystis hominis 26 reasons variation poorly understood although certain factors indirect route transmission known associated broad host range 26 viruses one factor use cell receptors phylogenetically conserved 23 crucial ability cellfree virus infect hosts presence appropriate cell receptors host cells receptors conserved across range potential host species hosts likely predisposed box 1 epidemic thresholds r 0 basic reproduction number basic reproduction number r 0 average number secondary cases infection generated single primary case introduced large population previously unexposed hosts 5 r 0 related transmissibility pathogen new infections per unit time duration infectiousness sometimes referred measure transmission potential analogies r max simple ecological theory refers situation densitydependent constraints spread infection r 0 defines important threshold r 0 o1 primary infection average generate one secondary infection pathogen capable invading host population r 0 1 primary case average fail replace although short chains transmission still possible single introduction lead minor outbreak expected final size outbreak infectious disease final related r 0 0 number primary cases infection using modified version kermackmckendrick equation equation 26 n size susceptible population behaviour equation illustrated ia r 0 1 size outbreak determined mainly number primary cases 0 r 0 o1 size outbreak determined mainly size susceptible population n r 0 close 1 size outbreak sensitive precise value r 0 even r 0 o1 major epidemic inevitable possible infection die without causing major epidemic owing demographic stochasticity probability major epidemic occurring related r 0 0 using equation ii 17 behaviour equation illustrated ib r 0 values much 1 high probability major epidemic occur unless many primary cases even larger r 0 values good chance epidemic occur primary cases probability following single crossspecies transmission event pathogen r 0 1in new host pathogen adapts new host outbreak approximated equation iii 19 small probability required genetic change occurs single infection behaviour equation illustrated ic new r 0 o1 evolved pathogen give rise major epidemic probability given equation ii 17 c approximate relationship valid m1 r 0 1 close 1 probability pathogen adapts outbreak r 0 becomes o1 p adaptation original value r 0 equation iii mz00001 0001 001 redrawn 19 review trends ecology evolution vol20 5 may 2005 infection viruses using receptors example use conserved receptors might explain wide host ranges footandmouth diseases virus fmdv uses integrin vitronectin rabies virus uses nicotinic acetylcholine receptor 27 however even capable infecting different host species pathogens usually although always significantly less infectious referred species barrier substantial implying much higher doses required infect new host example dose rabies virus foxes required infect dogs cats shown experimentally million times greater required infect foxes 28 third final step successful species jump pathogen sufficiently transmissible individuals within new host population discussed relates value r 0 therefore whether pathogen successfully invade new host population r 0 o1 sense epidemic waiting happen numerous recent examples include introductions west nile virus north american birds phocine distemper virus north sea seals dutch elm disease elms uk usa indeed emerging pathogens often special concern absence shared evolutionary history new host implies absence evolved constraints susceptibility pathogenicity might least instances enable disease outbreaks large magnitude unusual severity 21 29 conversely r 0 1 arguments presented earlier imply primary case result chain transmission new host population stutter extinction however might overly optimistic possibility pathogen evolves r 0 becomes o1 result could go generate major epidemic 19 evolution adaptation pathogen involve genetic changes ranging nucleotide substitutions eg canine parvovirus cpv 30 gene capture organisms eg salmonella enterica escherichia coli 31 recombination reassortment eg h5n1 influenza 32 ophiostoma novoulmi agent dutch elm disease 33 hybridization eg phytophthora alni alder trees northwest europe appears allopolyploid recombinant newly introduced pathogen hard woods p cambivora related specialist pathogen raspberries strawberries 34 adaptation might rapid pathogen lineages adapt different host tissues vector cells versus host cells 25 35 probability successful adaptation occurring depends several factors number primary infections 0 ii initial r 0 infection new host population iii number mutations genetic changes required iv likelihood changes occurring r 0 changes step relatively simple see probability emergence increases linearly 0 much sensitive evolution r 0 particularly close 1 19 probability rare evolutionary step proportional expected size initial outbreak hence number opportunities required genetic changes occur box 1 ic expected size outbreak related nonlinearly r 0 box 1 ia conditions likely differ outbreak outbreak outbreak sizes practice tend highly overdispersed occasional larger outbreaks providing opportunities adaptation 2 beginning understand biology underlying host adaptation instances example virus receptor use labile 27 point escherichia coli o157 scotland 1996 2003 blue line 63 outbreaks 1008 cases data health protection scotland httpwwwhpsscotnhsuk examples outbreaks small 10 cases outbreaks large hundreds cases indicated strongly convex shape plots consistent general trend disease outbreak size distributions follow power law exponent o2 indicative severe overdispersion 26 review trends ecology evolution vol20 5 may 2005 mutations viral capsid enable use new receptors example fmdv switch using additional receptors accumulating amino acid substitutions cell culture 36 feline panleukopenia virus fplv evolved cpv acquiring ability use canine transferrin receptor result changes capsid amino acid sequence 30 adaptation veev equines associated changes envelope glycoprotein 25 even though sometimes possible point genetic differences pathogens original new host often difficult ascribe changes events original host population jump ie predisposition novel genotypes jump species ii events adaptation phase shift r 0 1 r 0 o1 new host iii subsequent divergence pathogen established new host host jumping likely important driver pathogen diversity evolutionary time evidence historically deeper host jumps provided incongruencies phylogenetic topologies host species respective pathogens 37 example recent study rna viruses hantaviruses spumaviruses avian sarcoma leucosis viruses showed significant levels congruence host species whereas arenaviruses lyssaviruses showed congruence 38 several additional issues could also analyzed using framework assessing likelihood successful invasion many apply biological invasions general 39 example fluctuating transmission rates increase persistence times pathogens r 0 1 40 also possible rather single introductions new host population periods pathogen spreading mixture host species eg proposed human sleeping sickness 41 could increase persistence times new host simultaneously reduce selection pressure pathogen adapt new host new host population unlikely homogeneous individuals might susceptible novel pathogens eg owing immunosuppression andor exposed eg owing behaviour spatial location 42 andor likely transmit infection socalled supershedders heterogeneities increase outbreak sizes 5 structure new host population might also important contrast ebola usually affects remote communities sars arose region high human population density large numbers movements extensive travel similarly hivaids epidemic apparently took escaped remote communities entered urban populations 17 general issue relationship samplers individuals high risk acquiring novel infections spreaders individuals high potential transmitting novel infection onwards within new host population closer epidemiological linkage groups greater potential successful invasions new pathogens mechanism genetic change pathogen also likely important evolve mutation also recombination eg influenza viruses sars coronavirus 32 43 gene capture eg pathogenicity islands antimicrobial resistance genes bacteria 44 45 hybridization eg p alni might influence epidemiology species jumps requiring host coinfected two different pathogens perhaps different sources emergence new pathogen following species jump represents successful colonization new habitat reflecting emerging pathogens compared weeds 46 although extremely hard predict pathogens likely jump host species hints progress made example perhaps striking feature list examples species jumps given table 1 pathogens involved mostly disproportionately singlestranded rna viruses although table 1 regarded exhaustive survey might well genuine effect reflecting typically broader host ranges much higher mutation rates rna viruses 26 facilitating initial infection new host subsequence adaptation host refining observation none rna viruses listed transmitted arthropod vectors perhaps reflecting constraints imposed small genome compatible vector well definitive host support emerging longrecognised zoonotic rna arboviruses tend poorly transmissible humans 24 25 although experimental evidence constraints inconclusive 25 second notable feature table 1 obvious indication close taxonomic relatedness original new host species consistent systematic survey emerging zoonotic pathogens humans 47 found probable reservoirs rank order ungulates ii carnivores iii rodents iv primates v birds nonmammalian hosts vi bats vii marine mammals even suggested nanoviruses jumped plants vertebrates 48 broad host range seems important potential pathogen jump species relatedness hosts involved 23 24 47 unpredictability pathogen emergence means first line defence effective surveillance requiring identification monitoring highrisk populations individuals locations even setting sentinel systems 11 recent work using agentbased simulation models points way efficient design surveillance systems based understanding contact network within host populations 49 effective public health andor veterinary response requires prompt coordinated action multidisciplinary teams exemplified recent global effort led world health organization httpwwwwhoint combat sars 50 importance rapid identification assessment action review overstated often single biggest factor affecting scale epidemic speed effective interventions put place 51 52 
In [36]:
tqdm.pandas()
test = covid_df.copy()
test['parsed_text'] = test['body_text'].progress_apply(parse_text)
test
100%|██████████| 29602/29602 [20:04<00:00, 24.57it/s]
Out[36]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language sha publish_time parsed_text
0 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 10.1080/14787210.2017.1271712 Update on therapeutic options for Middle East ... Introduction: The Middle East Respiratory Synd... The Middle East respiratory syndrome coronavir... Al-Tawfiq, Jaffar A.; Memish, Ziad A. Expert Rev Anti Infect Ther 2748 996 English 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 2016-12-24 middle east respiratory syndrome coronavirus m...
1 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 10.1093/cid/ciaa256 A Novel Approach for a Novel Pathogen: using a... Thousands of people in the United States have ... The 2019 novel coronavirus (SARS-CoV-2), ident... Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic... Clin Infect Dis 944 486 English 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 2020-03-12 2019 novel coronavirus sarscov2 identified cau...
2 ab680d5dbc4f51252da3473109a7885dd6b5eb6f 10.1016/b978-0-12-800049-6.00293-6 Evolutionary Medicine IV. Evolution and Emerge... Abstract This article discusses how evolutiona... The evolutionary history of humans is characte... Scarpino, S.V. Encyclopedia of Evolutionary Biology 2884 1091 English ab680d5dbc4f51252da3473109a7885dd6b5eb6f 2016-12-31 evolutionary history humans characterized dyna...
3 6599ebbef3d868afac9daa4f80fa075675cf03bc 10.1016/j.enpol.2008.08.029 International aviation emissions to 2025: Can ... Abstract International aviation is growing rap... Sixty years ago, civil aviation was an infant ... Macintosh, Andrew; Wallace, Lailey Energy Policy 5838 1587 English 6599ebbef3d868afac9daa4f80fa075675cf03bc 2009-01-31 sixty years ago civil aviation infant industry...
4 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 10.1093/jac/dkp502 Inhibition of enterovirus 71 replication and t... OBJECTIVES: Enterovirus 71 (EV71) causes serio... Enteroviruses are members of the family Picorn... Hung, Hui-Chen; Chen, Tzu-Chun; Fang, Ming-Yu;... J Antimicrob Chemother 3121 1064 English 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 2010-01-20 enteroviruses members family picornaviridae 70...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
29597 228650bc0429064d800d4b9c5fb0e00c2533a579 10.1371/journal.pone.0215186 Lipidome profiles of postnatal day 2 vaginal s... We hypothesized that postnatal development of ... Early nutritional environment affects long ter... Harlow, KaLynn; Ferreira, Christina R.; Sobrei... PLoS One 4139 1489 English 228650bc0429064d800d4b9c5fb0e00c2533a579 2019-09-26 early nutritional environment affects long ter...
29598 2246e28681bde69c65dc9081df367bb661997f19 10.1371/journal.pntd.0000690 Secondary Syphilis in Cali, Colombia: New Conc... Venereal syphilis is a multi-stage, sexually t... Syphilis is a sexually transmitted disease (ST... Cruz, Adriana R.; Pillay, Allan; Zuluaga, Ana ... PLoS Negl Trop Dis 5621 1976 English 2246e28681bde69c65dc9081df367bb661997f19 2010-05-18 syphilis sexually transmitted disease std caus...
29599 577c6a13f9ef70e9756890fc66e98f537c01ac0a 10.1038/srep21878 Replication and shedding of MERS-CoV in Jamaic... The emergence of Middle East respiratory syndr... Scientific RepoRts | 6:21878 | DOI: 10 .1038/s... Munster, Vincent J.; Adney, Danielle R.; van D... Sci Rep 2832 1012 English 577c6a13f9ef70e9756890fc66e98f537c01ac0a 2016-02-22 scientific reports 621878 10 1038srep21878 ara...
29600 c5c2bc7a07670d6fb970d84a59aab3832752a3f1 10.3390/v10040199 Role of the ERK1/2 Signaling Pathway in the Re... We have previously shown that the infection of... Arenaviruses are enveloped RNA viruses contain... Brunetti, Jesús E.; Foscaldi, Sabrina; Quintan... Viruses 4805 1522 English c5c2bc7a07670d6fb970d84a59aab3832752a3f1 2018-04-17 arenaviruses enveloped rna viruses containing ...
29601 ba29366173f97f54a22e5c410b3d05e9a9649d28 10.3390/insects10110394 Foodborne Transmission of Deformed Wing Virus ... Virus host shifts occur frequently, but the wh... Emerging infectious diseases (EIDs) can cause ... Schläppi, Daniel; Lattrell, Patrick; Yañez, Or... Insects 3236 1237 English ba29366173f97f54a22e5c410b3d05e9a9649d28 2019-11-07 emerging infectious diseases eids cause signif...

29602 rows × 13 columns

In [37]:
# assign the copy we made with the parsed text to our current working dataframe
covid_df = test

Word Count Distribution

Since we're already working with text data here, an interesting thing to note here is how our journals are distributed in terms of length! We get a good idea of how long our papers are here, for the most part, our journals are under 5500 words in length, and it's only a very small part of our dataset that has extremely long articles (e.g. the max one with 172000 words in total, sheesh). Note: this word count data was before we cleaned it using NLTK, so this includes all stopwords as well.

In [38]:
sns.set(style='white', palette='dark', color_codes=True)
In [39]:
fig, ax = plt.subplots(figsize=(20, 20))
plt.ylabel('Percentage of Articles', fontsize=15)
sns.distplot(covid_df['body_word_count'], ax=ax, color='g')
plt.title('Total Word Count', fontsize=25)
plt.xlabel('Words', fontsize=15)
plt.show()
covid_df['body_word_count'].describe()
Out[39]:
count     29602.000000
mean       4559.986420
std        3528.632565
min           2.000000
25%        2704.250000
50%        3846.000000
75%        5533.000000
max      171948.000000
Name: body_word_count, dtype: float64
In [40]:
fig, ax = plt.subplots(figsize=(20, 20))
plt.ylabel('Percentage of Articles', fontsize=15)
sns.distplot(covid_df['body_unique_count'], ax=ax, color='m')
plt.title('Unique Word Count', fontsize=25)
plt.xlabel('Words', fontsize=15)
plt.show()
covid_df['body_unique_count'].describe()
Out[40]:
count    29602.000000
mean      1425.128505
std        748.541092
min          2.000000
25%        989.000000
50%       1288.000000
75%       1695.000000
max      25156.000000
Name: body_unique_count, dtype: float64

Vectorizing Our Documents!

Now comes the NLP part of our project - we'll be utilizing the idea of tf-idf (term frequency - inverse document frequency) in order to make each of our documents into a workable, normalized vector that we can manipulate and compare with other document vectors! We'll be using scikit's inbuilt feature extraction package that has a vectorizer for tf-idf for this task ~

Note: I limit the max number of features allowed in the vectorizer to 212 because we want to make sure this step doesn't take absolutely forever, but if we wanted to increase the model complexity to see if we could extract more useful information out of it, we can always change this in the future!

Again, if there are any questions on this part, the documentation for the tf-idf vectorizer can be found here

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=2**12)

# get all of the text that was processed by our NLTK parser
all_text = covid_df['parsed_text'].values
tfidf_matrix = tfidf.fit_transform(all_text)
tfidf_matrix.shape
Out[41]:
(29602, 4096)

mAchiNe LEaRnINg !!

Now comes the ~fancy~ machine learning algoritms; for this dataset, we're going to mainly use two unsupervised learning methods (since there's nothing to really classify/regress per se for this dataset of articles) - Latent Semantic Analysis (LSA) and K-Means Clustering!

LSA is actually very similar to PCA in the sense that it's great for dimensionality reduction, but it focuses specifically on the tf-idf matrix we've made, so we'll be using the TruncatedSVD package in scikit in order to perform it (since our term-document matrix is far more sparse than the covariance matrix a typical PCA would run on). This'll help us map these matrices to "semantic spaces" that have lower dimensionality.

In [42]:
from sklearn.decomposition import TruncatedSVD
t_svd = TruncatedSVD(n_components=100, random_state=2022)
tfidf_reduced = t_svd.fit_transform(tfidf_matrix)
tfidf_reduced.shape
Out[42]:
(29602, 100)

K-Means Clustering

Now is where we'll be running our clustering algorithm on the reduced dimensionality data! This'll give us a natural partitioning of our data into a number of clusters, and we'll find the right range of k-values using the "elbow method" (i.e. we'll find where the reduction in distortion begins to taper off given a value of k)

In [43]:
from sklearn.cluster import KMeans
import scipy

# create an array of distortion values so we can visualize and find the elbow
distortion = []
K = range(3, 60)
for k in K:
    km = KMeans(n_clusters=k, random_state=2022)
    km.fit(tfidf_reduced)
    distortion.append(sum(np.min(scipy.spatial.distance.cdist(tfidf_reduced, km.cluster_centers_, 'euclidean'), axis=1))\
     / tfidf_reduced.shape[0])
    if k % 5 == 0:
        print(f'distortion for {k} clusters')
distortion for 5 clusters
distortion for 10 clusters
distortion for 15 clusters
distortion for 20 clusters
distortion for 25 clusters
distortion for 30 clusters
distortion for 35 clusters
distortion for 40 clusters
distortion for 45 clusters
distortion for 50 clusters
distortion for 55 clusters
In [44]:
sns.set(palette='dark')
fig, ax = plt.subplots(figsize=(20, 20))
sns.lineplot(x=K, y=distortion)
plt.title('Distortion Plot', fontsize=20)
plt.xlabel('# of Clusters "K"', fontsize=15)
plt.ylabel('Distortion', fontsize=15)
plt.show()

Although it's unclear at first, we can see that the "staggering" effects begin to happen around 20 and we begin to get less and less noticable reductions around 30, so we'll set k = 25 for our clustering. We'll run it again with this value and now add a column to our dataframe denoting the cluster assignments.

In [45]:
# set number of clusters, k 
k = 25

# perform the algorithm with the given value of k 
km = KMeans(n_clusters=k, random_state=2022)
assignments = km.fit_predict(tfidf_reduced)
In [46]:
covid_df['cluster_assigments'] = assignments
covid_df.head()
Out[46]:
paper_id doi title abstract body_text authors journal body_word_count body_unique_count language sha publish_time parsed_text cluster_assigments
0 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 10.1080/14787210.2017.1271712 Update on therapeutic options for Middle East ... Introduction: The Middle East Respiratory Synd... The Middle East respiratory syndrome coronavir... Al-Tawfiq, Jaffar A.; Memish, Ziad A. Expert Rev Anti Infect Ther 2748 996 English 4ed70c27f14b7f9e6219fe605eae2b21a229f23c 2016-12-24 middle east respiratory syndrome coronavirus m... 16
1 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 10.1093/cid/ciaa256 A Novel Approach for a Novel Pathogen: using a... Thousands of people in the United States have ... The 2019 novel coronavirus (SARS-CoV-2), ident... Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic... Clin Infect Dis 944 486 English 306ef95a3a91e13a93bcc37fb2c509b67c0b5640 2020-03-12 2019 novel coronavirus sarscov2 identified cau... 0
2 ab680d5dbc4f51252da3473109a7885dd6b5eb6f 10.1016/b978-0-12-800049-6.00293-6 Evolutionary Medicine IV. Evolution and Emerge... Abstract This article discusses how evolutiona... The evolutionary history of humans is characte... Scarpino, S.V. Encyclopedia of Evolutionary Biology 2884 1091 English ab680d5dbc4f51252da3473109a7885dd6b5eb6f 2016-12-31 evolutionary history humans characterized dyna... 14
3 6599ebbef3d868afac9daa4f80fa075675cf03bc 10.1016/j.enpol.2008.08.029 International aviation emissions to 2025: Can ... Abstract International aviation is growing rap... Sixty years ago, civil aviation was an infant ... Macintosh, Andrew; Wallace, Lailey Energy Policy 5838 1587 English 6599ebbef3d868afac9daa4f80fa075675cf03bc 2009-01-31 sixty years ago civil aviation infant industry... 15
4 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 10.1093/jac/dkp502 Inhibition of enterovirus 71 replication and t... OBJECTIVES: Enterovirus 71 (EV71) causes serio... Enteroviruses are members of the family Picorn... Hung, Hui-Chen; Chen, Tzu-Chun; Fang, Ming-Yu;... J Antimicrob Chemother 3121 1064 English 44290ff75bad8ffaf5d3028420739ce7b08dc2a9 2010-01-20 enteroviruses members family picornaviridae 70... 22

t-SNE

Since we want to be able to visualize our data and at least get somewhat of a sense of how our data is organized (beyond just assignment numbers and meaningless data tables), we're going to try and bring down the dimensionality of our data even further so that we can visualize it in 2D, hence plot it in our notebook, and what better way to do this than t-SNE! For the sake of computational efficiency, we're going to utilize the reduced dimension matrix of data from our LSA, as otherwise it's going to take forever, but if you have the time, you should check it out with the entire featured term-document matrix tf-idf!

Side Note: The LSA that we performed earlier using TruncatedSVD actually has the ability to project our text data onto just two dimensions as well, effectively doing what t-SNE is doing here, but t-SNE is slightly more advanced and effective for this task thanks to advanced mathematical machinery, and for the sake of diversity of methodology, we'll just stuck with t-SNE for now! (however, if I have time later, I'll include another graphic comparing which visualization method is better!)

In [47]:
from sklearn.manifold import TSNE

t_sne = TSNE(verbose=1, perplexity=50, random_state=2020)
tfidf_2_dim = t_sne.fit_transform(tfidf_reduced)
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 29602 samples in 0.193s...
[t-SNE] Computed neighbors for 29602 samples in 170.058s...
[t-SNE] Computed conditional probabilities for sample 1000 / 29602
[t-SNE] Computed conditional probabilities for sample 2000 / 29602
[t-SNE] Computed conditional probabilities for sample 3000 / 29602
[t-SNE] Computed conditional probabilities for sample 4000 / 29602
[t-SNE] Computed conditional probabilities for sample 5000 / 29602
[t-SNE] Computed conditional probabilities for sample 6000 / 29602
[t-SNE] Computed conditional probabilities for sample 7000 / 29602
[t-SNE] Computed conditional probabilities for sample 8000 / 29602
[t-SNE] Computed conditional probabilities for sample 9000 / 29602
[t-SNE] Computed conditional probabilities for sample 10000 / 29602
[t-SNE] Computed conditional probabilities for sample 11000 / 29602
[t-SNE] Computed conditional probabilities for sample 12000 / 29602
[t-SNE] Computed conditional probabilities for sample 13000 / 29602
[t-SNE] Computed conditional probabilities for sample 14000 / 29602
[t-SNE] Computed conditional probabilities for sample 15000 / 29602
[t-SNE] Computed conditional probabilities for sample 16000 / 29602
[t-SNE] Computed conditional probabilities for sample 17000 / 29602
[t-SNE] Computed conditional probabilities for sample 18000 / 29602
[t-SNE] Computed conditional probabilities for sample 19000 / 29602
[t-SNE] Computed conditional probabilities for sample 20000 / 29602
[t-SNE] Computed conditional probabilities for sample 21000 / 29602
[t-SNE] Computed conditional probabilities for sample 22000 / 29602
[t-SNE] Computed conditional probabilities for sample 23000 / 29602
[t-SNE] Computed conditional probabilities for sample 24000 / 29602
[t-SNE] Computed conditional probabilities for sample 25000 / 29602
[t-SNE] Computed conditional probabilities for sample 26000 / 29602
[t-SNE] Computed conditional probabilities for sample 27000 / 29602
[t-SNE] Computed conditional probabilities for sample 28000 / 29602
[t-SNE] Computed conditional probabilities for sample 29000 / 29602
[t-SNE] Computed conditional probabilities for sample 29602 / 29602
[t-SNE] Mean sigma: 0.128483
[t-SNE] KL divergence after 250 iterations with early exaggeration: 95.268784
[t-SNE] KL divergence after 1000 iterations: 2.121657
In [58]:
tfidf_2_dim
Out[58]:
array([[ 60.75235  ,   7.594087 ],
       [  6.0101366,  62.89814  ],
       [-16.230865 ,  18.186113 ],
       ...,
       [ 55.64865  ,   2.8969972],
       [-20.519497 , -35.409058 ],
       [ -2.6207633,  -2.0518272]], dtype=float32)
In [84]:
fig, ax = plt.subplots(figsize=(20, 20))
sns.scatterplot(x=tfidf_2_dim[:, 0], y=tfidf_2_dim[:, 1])
plt.title('Basic t-SNE', fontsize=20)
plt.show()

Of course, this gives us the clusters, but it'd be far more interesting if we could see the distinction between clusters using the labels that we had from K-Means! So, let's do just that!

In [86]:
fig, ax = plt.subplots(figsize=(20, 20))
sns.scatterplot(x=tfidf_2_dim[:, 0], y=tfidf_2_dim[:, 1], hue=assignments, legend='full', palette=sns.hls_palette(k, l=0.43, s=0.6))
plt.show()

Significance

Although there are a few outliers in terms of coloring, the super cool thing that we can observe in this graph is that, although they were done separately, the K-Means clustering algorithm and t-SNE algortihm both agreed on how to cluster the data, indicating to us that there indeed is some consistency in our data, and that there must be some shared uniform characteristics among these clusters. By extracting these characteristics, we could actually quantify what connects these research papers together and potentially help researchers/scientists better explore the current literature on COVID-19 by helping them discover additional research they never would have made a connection to otherwise! (since both t-SNE and K-Means cluster using far higher dimensional features than just common search words/tags that we typically utilize when search for articles of interest)

In [66]:
# section where we'll be doing LDA 
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# LDA improves upon the tf-idf model to make more than just a clustering - can now generate topics for each bpaper
vectorizers = []
for i in range(k): 
    vectorizers.append(CountVectorizer(min_df=7, max_df=0.8, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}'))
In [67]:
processed_data = []

for i, vectorizer in enumerate(vectorizers): 
    try:
        processed_data.append(vectorizer.fit_transform(covid_df.loc[covid_df['cluster_assigments'] == i, 'parsed_text']))
    except Exception:
        print('not enough points in cluster')
        processed_data.append(None)
In [68]:
NUM_TOPICS = 22
lda_models = []
for i in range(k):
    model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online', verbose=False, random_state=2022)
    lda_models.append(model)
In [69]:
lda_models[0]
Out[69]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=22, n_jobs=None,
                          perp_tol=0.1, random_state=2022,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=False)
In [70]:
clusters_lda = []

for i, model in enumerate(lda_models):
    if (i % 5 == 0 or i == len(lda_models)): 
        print(f'Processing cluster #{i}')
    if processed_data[i] != None:
        clusters_lda.append(model.fit_transform(processed_data[i]))
In [75]:
def get_topics(model, vectorizer, top_n = 3):
    curr = [] 
    all_keywords = []

    for i, topic in enumerate(model.components_):
        words = [(vectorizer.get_feature_names()[j], topic[j]) for j in topic.argsort()[:-top_n - 1:-1]]
        for word in words: 
            if word[0] not in curr:
                all_keywords.append(word)
                curr.append(word[0])

    all_keywords.sort(key=lambda x: x[1])
    all_keywords.reverse
    return_words = []
    for word in all_keywords:
        return_words.append(word[0])

    return return_words
In [76]:
all_topics = []

for i, model in enumerate(lda_models):
    if processed_data[i] != None: 
        all_topics.append(get_topics(model, vectorizers[i]))
In [77]:
all_topics[0][:10]
Out[77]:
['transmission',
 'use',
 'contact',
 'viral',
 'coronaviruses',
 'mortality',
 'protein',
 'sars',
 'sarscov',
 'baseline']
In [80]:
f = open('topics.txt', 'w')
count = 0

for topic_list in all_topics:
    if processed_data[count] != None:
        print(', '.join(topic_list) + '\n')
        f.write(', '.join(topic_list) )
    else: 
        f.write('Not enough instances \n')
        print(', '.join(topic_list) + '\n')
        f.write(', '.join(topics_list) + '\n')
    count += 1

f.close() 
transmission, use, contact, viral, coronaviruses, mortality, protein, sars, sarscov, baseline, group, stakeholders, emergency, nasal, log, rna, response, college, anxiety, body, bodies, wearing, students, facemasks, mental, knowledge, dead, facemask, hcq, psychiatric, spike, chinese, pregnant, women, social, information, risk, care, patient, data, study, pneumonia, clinical, virus, outbreak, number, epidemic

pdna, bcd, cds, mekk, horses, torc, baculovirus, platelets, platelet, prdx, brain, jnk, influenza, orf, cleavage, protease, stress, wnv, sting, identified, ire, ddx, upregulated, lncrnas, upr, synthesis, hbv, analysis, membranes, data, dna, trim, sarscov, prrsv, infected, tumor, mir, hiv, receptor, atg, treatment, entry, inflammatory, signaling, sirnas, genome, eif, disease, delivery, nsp, tlr, hcv, antiviral, rigi, cancer, membrane, sirna, autophagy, ifn

pip, pvm, def, replication, pvax, rvsv, ifni, tumor, receptors, fibrosis, airway, ova, ang, marrow, jev, antibodies, balt, igg, mir, isg, asc, bone, foxp, vsv, mcmv, ifnar, zikv, macrophages, stat, analysis, ace, tcell, tregs, ceacam, memory, treg, animals, genes, activation, significantly, strains, group, lung, vaccine, type, sarscov, liver, eae, lungs, responses, ccr, brain, ifn, mhv, day, demyelination, infected, cns

infected, amino, binding, sequence, sequences, groups, genes, recombination, isolates, cells, lumen, proventriculus, mexico, exposure, tubes, minutes, subgroup, plp, utr, beaudette, pcr, oligonucleotide, nsps, vics, phosphorylation, apoptosis, detection, ark, elisa, vero, samples, nsp, poultry, flocks, cell, expression, proteins, birds, vaccine, group

treatment, example, viral, veev, water, clinical, host, vhf, equi, coli, foals, sea, marine, parasites, patients, immune, increased, testing, cryptosporidium, sheep, cattle, dengue, bacteria, exposure, cells, respiratory, avian, genes, research, signs, lesions, blood, populations, food, people, fever, wildlife, samples, hosts, poultry, countries, data, birds

infection, etec, herds, cells, coli, austria, eimeria, pair, coccidiosis, recommended, social, operations, contact, veterinarians, pregnancy, risk, therapy, oocysts, igg, strain, diarrhoea, salmonella, strains, diarrhea, cows, fluid, bcov, cattle, intake, dairy, concentrations, milk, plasma, group, fed, serum, brd, virus, respiratory, brsv, parvum, cryptosporidium, treatment, colostrum

smart, medical, studies, water, care, security, states, laboratory, change, cases, surveillance, manifesting, travelassociated, oman, respirator, quarantine, respirators, migrant, migration, migrants, education, biological, students, training, sars, funding, mortality, chinese, outbreak, environmental, people, policy, ihr, regional, food, population, study, media, social, preparedness, ebola, pandemic, animal, china, emergency, services

kong, children, cells, vaccine, viral, sick, ambulance, direct, immunization, laiv, codon, calls, poland, casepatients, genotype, nhs, ifitm, iiv, type, hcp, care, season, cell, patients, antibodies, severe, detection, assay, human, strains, samples, antiviral, pneumonia, treatment, ili, positive, transmission, avian, vaccination, pandemic, surveillance

gene, rtpcr, realtime, nucleic, disease, primer, influenza, diagnosis, rna, prv, pestis, biothreat, parasites, fta, cruzi, parasite, sheep, semen, end, pairs, agents, blood, dengue, dogs, cation, bovine, species, denv, strains, sequence, signal, specimens, probe, extraction, rsv, surface, pathogens, min, target, test, diagnostic, probes, tests, respiratory, lamp

cell, infected, gene, virulent, papn, pad, pdcov, group, strains, cells, route, bhk, pgm, ion, potassium, coinfection, channel, bile, neutralizing, days, compounds, rabbit, vsv, peptides, peptide, rabbits, residues, swabs, pro, positive, disease, serum, feed, nsp, tgev, ifn, dpi, iga, sows, elisa, sequence, vero

plant, plants, expression, inactivated, dose, dna, groups, rbd, vlps, viruses, aav, psaa, pneumococcal, pneumoniae, hamsters, delivery, injection, env, transgene, niv, gag, dengue, vector, skin, denv, live, zikv, dogs, iga, vectors, rsv, neutralizing, diseases, group, surface, health, plasmid, new, sarscov, peptide, animals, hiv, igg, epitopes, mice, influenza, mucosal

com, respuesta, muy, riesgo, antibi, cultivo, cnicas, ntomas, neumon, respiratorias, influenza, debe, asma, lengua, traducci, siglas, entrada, control, care, vih, uni, health, dos, foi, pol, diarrea, adenovirus, salud, crisis, antiviral, personas, rotavirus, uma, infec, vrs, estudio, detecci

fever, rate, information, virus, sarscov, symptoms, quarantine, withdrawal, insufficiency, mimic, memory, ifn, autoantibodies, tcell, concentration, temperature, chip, mass, peptides, peptide, genotype, steroid, therapy, model, epidemic, taiwan, protein, singapore, respondents, mers, kong, hong, contact, treatment, period, hcws, workers, public, transmission, serum, chest, staff, care, days, lung, viral, samples

uncovered, nat, coast, nats, iav, sarcoidosis, sle, rvc, vasculitis, gastroenteritis, adem, encephalitis, stool, csf, syndrome, military, diarrhea, fever, days, renal, bal, blood, liver, syphilis, children, hsct, nasal, bacteria, recipients, therapy, transplant, cmv, bronchiolitis, groups, icu, immune, pulmonary, hadv, mortality, bacterial, treatment, analysis, infants, cells, pcr, copd, pneumoniae, influenza, lung, exacerbations, cap, group, rsv, samples, asthma, viruses, pneumonia

spread, interval, volatility, network, individuals, trading, guinea, leone, sierra, subjects, droplets, domain, velocity, passengers, air, masks, age, risk, intervention, forecasting, series, proposed, hospital, algorithm, cities, tree, posterior, information, probability, patients, pandemic, distribution, host, endemic, sars, genetic, china, virus, viral, surveillance, tracing, social, spatial, equilibrium, contacts, estimated, susceptible, estimates, contact, control, outbreak, nodes, networks

ddd, transfers, cdi, company, companies, events, set, states, network, security, infl, people, drug, waste, articles, ebola, market, women, disease, industry, masks, children, urban, laboratory, kong, review, mask, air, rate, hong, results, science, social, blood, patient, disaster, hcws, reported, tourism, ppe, clinical, government, control, management, model, hand, information, chinese, respondents, studies, risk, hospitals, participants, emergency, research, infection, china, cases, care, patients

time, saudi, cells, patients, pro, samples, camels, speed, monthly, humidity, classification, pigs, mva, llamas, mers, chadox, amplification, assays, rtlamp, expression, outbreak, risk, fusion, health, lung, nsp, vaccine, animals, rna, antibodies, rbd, mice, binding, protein, dpp, sarscov, camel, patient, transmission

gene, slcov, strains, phylogenetic, hku, protein, sequences, cells, rabies, sarscov, sequence, vein, ectoparasites, sample, amplexicaudatus, response, test, injection, blood, philippines, colonies, seroprevalence, adult, sites, china, cov, roosts, selection, conservation, tlr, buildings, cat, length, node, covs, expression, cell, immune, disease, transmission, samples

sites, peptides, ecor, eyfp, hsv, eef, replication, regions, pol, disorder, disordered, isg, usp, ubiquitin, transport, interaction, identified, yeast, sarscov, purification, golgi, surface, cov, interactions, set, number, human, site, lipid, mhv, peptide, phage, dna, membrane, sars, domain, viral, protease, activity, antibody, structure, expression, antibodies, residues, rna, cleavage, epitopes, pro, viruses, nsp, fusion

nucleotides, negativestrand, recombination, region, coli, benthamiana, ctv, grna, siv, genes, antibody, cdv, hev, zikv, hiv, acids, denv, class, dna, human, canine, dogs, pseudoknot, reference, amino, cpv, leader, coronavirus, secondary, methods, method, set, rnas, mhv, synthesis, conserved, structures, mutation, replication, frameshift, usage, expression, mutations, codon, translation, mrna, samples, orf, pcr, strain, proteins, structure, sequencing, frameshifting, cells, reads, strains

nes, syst, diagnostic, res, allergique, respiratoire, infections, mie, arabie, mers, saoudite, changes, techniques, tests, health, latex, fices, sars, toronto, partage, clostridium, eaux, classe, souche, particules, zone, chien, sions, nouveaun, alv, cells, macrophages, pathog, traitement, selles, nie, lymphop, maladie, chauvessouris, enfants, lenfant, sant, risque, toux, ine, cellules, diarrh, prot, patients, pid

inhibition, inhibitors, gave, treatment, drug, bases, mannich, hydroxyurea, cats, ribonucleotide, trafficking, reductase, flower, brazilian, tlr, pollen, cds, mefloquine, fcv, mnv, lonicera, japonica, surface, norovirus, propolis, dna, oral, patients, clpro, lycorine, glabra, binding, glycyrrhizin, licorice, metal, nmr, complexes, inhibitor, antimicrobial, bacteria, protease, antibacterial, drugs, cancer, plant, derivatives, mmol, pro, extracts, extract, assay, reaction, viral, antiviral, virus

npc, fcrn, tax, clec, htlv, interferon, ifnc, bacterial, bacteria, atp, kda, tetherin, ifnl, atg, potential, surface, ebv, golgi, use, rsv, group, tgev, ceacam, cholesterol, intestinal, membrane, tissue, autophagy, mhv, gene, ifitm, orf, viruses, epithelial, cancer, min, fusion, cultures, tumor, antibody, macrophages, response, hiv, entry, lung, prrsv, transfected, activity, antibodies, proteins, binding, levels, antiviral, nsp, replication, immune, ifn, infected

adem, avidity, hpv, della, del, una, trna, aptamers, cns, dna, chickens, hev, dog, feed, canine, pulmonary, zikv, coli, water, brain, und, significant, respiratory, surface, virus, rabbits, wind, rats, detection, animals, die, der, model, acid, cell, bacteria, levels, groups, response, air, particles, immune, het, gene, expression, genes, een, cells, serum, antibody, clinical, ang, disease, lung, group, dogs, van, ace, pigs, infection

cell, felv, test, blood, serum, cells, fipv, machine, progression, dystrophy, equilibrium, specimen, muscle, muscular, orange, peptides, pregnancy, queens, ifng, antigen, cases, species, human, genome, domestic, animals, data, lesions, shelter, expression, treatment, dogs, signs, samples, type, group, fiv, fcov, fip

Challenges/Obstacles

Overall I found this project super cool! Having something to build from the groundup is definitely intimidating at first, but when you get into the groove of it and start becoming super productive and familiar with your data and what you want out of it, you start doing some really cool things, and it's awesome to see that progression. That being said however, some of the challenges were:

  1. It was very hard to read the dataset in the first place; trying to determine what all these random folders and filenames mean is extremely confusing and intimidating at first, and tbh I almost switched my research topic within the first few days because I got so spooked by the data, haha. Glad I stuck with it though, got to do some really cool things with this dataset at the end of the day.

  2. Trying to figure out how break down the data into digestible bits! Beyond just debugging the NLTK parser and trying to make the process efficient so it didn't take forever, it was also a challenge just figuring out how to get the text data. The fact that everything was in json format in huge folders on my computer definitely didn't help either, and I guess it was a good way to force myself to learn what glob and os are used for in Python LOL - learning how to plot and visualize well in Python using seaborn was also definitely an accelerated learning process, but definitely feel super confident with data visualization now thanks to this project.

  3. Finally, just making sure that I applied LSA/K-means correctly so that I could get a nice cluster-coloring was a challenge, since you have to make sure all the dimensions line up as well and that you have the right inputs and outputs for every function and module you're using, so that was definitely a headache at times.

Conclusion / Potential Next Steps

Although there was definitely a lot of analysis and modeling done throughout the course of this project, there are definitely a ton of ways to take it further! My original plan was to run LDA (Latent Dirichlet Allocation) to extract the respective topics from each of the clusters, but finals szn (rip) took a heavy toll on the amount of time I had left for this project, so I didn't end up including it within the final project. However !! might do so in the future, so that's in the works (EDIT: Got around to finishing the part about LDA! You can see a pretty cool breakdown of the topics within the printed statement, and if you want a consolidated list it's available in the topics.txt file! Now might try to do some more stuff with Plotly/Dash) - otherwise, hope you found the project interesting! Definitely was super interesting for me, and grateful that I had this opportunity to work with a super cool dataset relevant to a current ongoing pandemic (not many times you can say that your school work is relevant to your everyday life, haha); thanks for an awesome semester!